strobealign-aemb module for metagenomic binning #394

psj1997 · 2024-02-13T06:30:57Z

Hi,

This is the aemb (abundance estimation for metagenomic binning) module, the usage is :

strobealign ref.fa read1.fq read2.fq --aemb > abun.txt

In my benchmarking, aemb can output similar results, but decrease more than 90% running time.

Sincerely
Shaojun

ksahlin · 2024-02-14T14:51:52Z

Thanks Shaojun. I think Marcel is a bit busy right now but I am sure he will have a look when there is sufficient time.

marcelm

Hi, here’s a partial review. Some additional comments:

Tests are needed
Documentation needs to be added (including a description of the file format).
Please ensure that --aemb and -x cannot be used at the same time.

src/aln.cpp

src/main.cpp

src/aln.cpp

marcelm · 2024-02-16T13:14:41Z

src/aln.cpp

+    std::vector<float> &abundance, 
+    int read1_len, 
+    int read2_len,
+    bool output_abundance      


Instead of modifying this function, I believe it would be better to just add a new function, maybe named get_abundances or so. You can copy the get_best_map_location function and adjust it as necessary. That will add some duplication, but that can be fixed later.

Please make it so that the function returns the abundances.

Please make it so that the function returns the abundances.

I take that back; that would be less efficient. It is ok to have this as a mutable in/out argument.

src/cmdline.cpp

marcelm

Here are some more comments. Also, this needs documentation and tests.

marcelm · 2024-02-20T09:20:07Z

src/aln.cpp

+    std::vector<float> &abundance, 
+    int read1_len, 
+    int read2_len,
+    bool output_abundance      


Please make it so that the function returns the abundances.

I take that back; that would be less efficient. It is ok to have this as a mutable in/out argument.

src/aln.cpp

src/main.cpp

src/aln.cpp

This way, the strobealign sources are more self-contained and it should not be necessary to have internet access at build time.

See https://github.com/alugowski/poolSTL Sorting the randstrobes is currently a bottleneck in index generation as it does not run in parallel. This is an attempt to parallelize it. poolstl’s sort uses regular std::sort under the hood. We currently use pdqsort_branchless, which is about twice as fast as std::sort, so parallel sorting breaks even with pdqsort_branchless at about 2-3 threads. It gets faster with more threads, but not as much as one would perhaps expect. Here are the sorting runtimes for CHM13: - 31 s with pdqsort_branchless - 59 s with std::sort - 34 s with parallel sort, 2 threads - 24 s with parallel sort, 4 threads - 23 s with parallel sort, 8 threads Another issue is that sorting is no longer in place, so memory usage goes up by a couple of gigabytes, which is another reason for me not to make this change.

This version indeed gives us a very nice speed improvement without using extra memory: Sorting-only runtimes: * 1 thread: 32.8 s * 2 threads: 20.3 s * 4 threads: 15.7 s * 8 threads: 14.8 s Overall indexing runtimes (before/after): * 1 thread: 151 s → 153 s * 2 threads: 100 s → 88 s * 4 threads: 73 s → 57 s * 8 threads: 63 s → 47 s

@alugowski

... by sorting them also by position. This way, it does not matter in which way the randstrobes vector is partitioned during sort. Otherwise, the order would depend on the number of threads used to create the index, making mapping results not reproducible across runs that do not use the same no. of threads. Note: There is still a tiny chance for collisions/nondeterminism because we ignore RefRandstrobe::m_packed. For efficiency, this uses a branchless comparison function inspired by @alugowski’s comment in PR ksahlin#386. Runtimes for sorting with one thread 32 s - only by hash 45 s - by hash and position using std::tie 42 s - by hash and position using branchless_compare from the PR 35 s - by hash and position using __uint128_t (this commit) Runtimes for sorting (four cores with hyperthreading) threads | sorting time | index creation time -|-|- 1 | 35 s | 154 s 2 | 25 s | 95 s 4 | 16 s | 60 s 8 | 15 s | 48 s

Closes ksahlin#395

Right now, strobealign only supports up to 2²⁴ sequences. If the user tries more, it would silently accept it, but later crash. This was triggered when trying to map to the Greengenes database https://ftp.microbio.me/greengenes_release/2022.10/ https://ftp.microbio.me/greengenes_release/2022.10/2022.10.seqs.fna.gz Even a single read like the one below trigger a crash ``` @M05314:127:000000000-BWLLJ:1:1101:15267:1654 2:N:0:1 CCTGTTCGCTCCCCACGCTTTCGTCCCTCAGCGTCAATATTGTGCCAGAATGCTGCCTTCGCCATTGGTGTTCCTCCTGATATCTACGCATGTCACCGCTACACCAGGAATTCCACATTCCTCTCACATATTCTATTTTATCAGTTTTGAT + AAA1AF@1>AAAGG1A0EAFGGEHAAEGFCG1AAEE/F2FG2F2FF1CA0FBDED1BGFGFFE?AF1BFFCFHDGFFHB1FFGFGEEFE/?/BF2F@/EGEEB00/0//0BFG1>B1BGFEFHHGGFFD12BGH2FDFFFGG22GDD>@/F ```

This is 3-state variable: SAM/PAF/Abundance

RFCT aemb

marcelm

Just two more tiny comments. Also, I will probably rearrange the documentation a little bit (clarifying the different output modes a bit) after this has been merged, but it’s probably easiest if it’s not part of this PR. I’ll also add a changelog entry later.

marcelm · 2024-02-21T09:46:34Z

src/aln.cpp

+                                        std::make_pair(std::cref(nams2), read2_len) }) {
+            size_t best_score = 0;
+            // We loop twice because we need to count the number of NAMs with best score
+            for (auto &t : nams) {


Can you please rename t to nam?

(Also below in align_or_map_single)

marcelm · 2024-02-21T09:52:40Z

src/main.cpp

@@ -105,6 +105,12 @@ InputBuffer get_input_buffer(const CommandLineOptions& opt) {
    }
 }

+void output_abundance(std::vector<double> abundances, References references){


Both parameters should be passed by const reference

src/main.cpp

Co-authored-by: Marcel Martin <[email protected]>

marcelm · 2024-03-04T09:45:37Z

This looks good now! I think this deserves a slightly longer section in the documentation, but I’ll do that after merging.

Update aemb module

a4a22c4

marcelm reviewed Feb 16, 2024

View reviewed changes

marcelm reviewed Feb 20, 2024

View reviewed changes

psj1997 and others added 26 commits February 20, 2024 22:48

Code improve; documentation update; testing update

f113da2

Merge branch 'ksahlin:main' into main

3ced054

Update single-end abundance testing file

c09d367

Update paired-end abundance testing file

0719ec5

Vendor zstr instead of fetching it from Git at build time

7418bb6

This way, the strobealign sources are more self-contained and it should not be necessary to have internet access at build time.

Bump poolSTL and use poolstl::pluggable_sort

c689d89

Update baseline commit

1c63767

Bump poolSTL to 0.3.5

6e07fcc

Update baseline commit

3540334

Update baseline commit

f1de1e2

Introduce canonical read length 75

dad60c8

Closes ksahlin#395

Update aemb module

9a79d89

Code improve; documentation update; testing update

358a5b2

Update single-end abundance testing file

cec26e8

Update paired-end abundance testing file

7be8e2c

RFCT Simplify code by merging duplications

47770ee

RFCT Simplify output_format handling

677edcc

This is 3-state variable: SAM/PAF/Abundance

RFCT Merge duplicated code

ff7222a

Merge branch 'main' into rfct_aemb

0948cd1

Merge pull request #3 from luispedro/rfct_aemb

782fed8

RFCT aemb

BUG fix of RFCT aemb

372b868

Remove unused vector in single-end mode

383515d

marcelm reviewed Feb 21, 2024

View reviewed changes

Minor fix

8c32350

marcelm reviewed Feb 26, 2024

View reviewed changes

src/main.cpp Outdated Show resolved Hide resolved

Update src/main.cpp

5a1ab9e

Co-authored-by: Marcel Martin <[email protected]>

marcelm merged commit bd183f5 into ksahlin:main Mar 4, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strobealign-aemb module for metagenomic binning #394

strobealign-aemb module for metagenomic binning #394

psj1997 commented Feb 13, 2024

ksahlin commented Feb 14, 2024

marcelm left a comment

marcelm Feb 16, 2024

marcelm Feb 20, 2024

marcelm left a comment

marcelm Feb 20, 2024

marcelm left a comment

marcelm Feb 21, 2024

marcelm Feb 21, 2024

marcelm commented Mar 4, 2024

strobealign-aemb module for metagenomic binning #394

strobealign-aemb module for metagenomic binning #394

Conversation

psj1997 commented Feb 13, 2024

ksahlin commented Feb 14, 2024

marcelm left a comment

Choose a reason for hiding this comment

marcelm Feb 16, 2024

Choose a reason for hiding this comment

marcelm Feb 20, 2024

Choose a reason for hiding this comment

marcelm left a comment

Choose a reason for hiding this comment

marcelm Feb 20, 2024

Choose a reason for hiding this comment

marcelm left a comment

Choose a reason for hiding this comment

marcelm Feb 21, 2024

Choose a reason for hiding this comment

marcelm Feb 21, 2024

Choose a reason for hiding this comment

marcelm commented Mar 4, 2024